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ABSTRACT 

Non-canonical guanine quadruplex structures are 
not only predominant but also conserved among 
bacterial and mammalian promoters. Moreover 
recent findings directly implicate quadruplex struc- 
tures in transcription. These argue for an intrinsic 
role of the structural motif and thereby posit that 
single nucleotide polymorphisms (SNP) that com- 
promise the quadruplex architecture could influence 
function. To test this, we analysed SNPs within 
quadruplex motifs (Quad-SNP) and gene expression 
in 270 individuals across four populations (HapMap) 
representing more than 14500 genotypes. Findings 
reveal significant association between quadruplex- 
SNPs and expression of the corresponding gene in 
individuals (P< 0.0001). Furthermore, analysis of 
Quad-SNPs obtained from population-scale sequen- 
cing of 1000 human genomes showed relative selec- 
tion bias against alteration of the structural motif. 
To directly test the quadruplex-SNP-transcription 
connection, we constructed a reporter system 
using the RPS3 promoter— remarkable difference 
in promoter activity in the 'quadruplex-destabilized' 
versus 'quadruplex-intact' promoter was noticed. 
As a further test, we incorporated a quadruplex 
motif or its disrupted counterpart within a synthetic 
promoter reporter construct. The quadruplex motif, 
and not the disrupted-motif, enhanced transcription 
in human cell lines of different origin. Together, 
these findings build direct support for quadruplex- 
mediated transcription and suggest quadruplex- 
SNPs may play significant role in mechanistically 
understanding variations in gene expression 
among individuals. 



INTRODUCTION 

111 addition to the canonical B DNA structure, DNA can 
adopt local secondary structure conformations. Role of 
non-canonical DNA structure has been implicated in im- 
portant biological functions including replication, recom- 
bination and transcription (1). DNA secondary structures 
have also been associated with translocations and muta- 
tions that cause genome instabihty (1,2). This raises the 
intriguing possibihty that locally formed DNA structure 
influences intrinsic cellular functions. In this context, it is 
interesting to consider the non-canonical secondary struc- 
ture adopted by guanine-rich DNA sequences called the 
G-quadruplex or G4 DNA. Gathering evidence indicates 
involvement of G-quadruplex motifs in chromatin 
packaging (3-5), recombination (6) and CpG methylation 
(7) in addition to gene transcription, which is most 
studied. 

G-quadruplex motifs are non-canonical Hoogsteen 
base-paired self-assembly of DNA strands in parallel/anti- 
parallel orientation stabilized by charge coordination with 
monovalent cations (especially K"^) (8-11). Initially 
observed to be enriched in bacterial promoters (12,13), 
potential G4 (PG4) motifs were subsequently found to 
be prevalent in human (14,15), chimpanzee (15), mouse 
(15), rat (15) and chicken (16) promoters. Furthermore, 
hundreds of PG4 motifs appear to be conserved among 
human, mouse and rat promoters (15). In vitro, c-MYC 
was the first case where a G-quadruplex-forming sequence 
in the nuclease hypersensitive element upstream of the PI 
promoter was shown to affect transcription (17). Gene 
expression was also found to be influenced by 
G-quadruplex-forming sequence motifs within the core 
promoter of human c-KlT (18,19) and k-RAS (20) onco- 
genes. In addition, promoter G-quadruplex motifs have 
been reported for many genes, including VEGF, PDGF, 
HlFla, BCL-2, RB, RET (21,22) and human telomerase 
hTERT (23,24). In case of thymidine kinase 1, we found 
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a non-canonical G-quadruplex motif, formed by two- 
guanine repeats instead of three, to be functionally 
active (25). 

More direct evidence in support of G-quadruplex- 
niediated transcription was obtained from chromatin 
immunoprecipitation (ChIP) experiments demonstrating 
that the non-metastatic factor NM23-H2 associates 
with the c-MYC promoter through a G-quadruplex 
motif (26). In addition, support for this mode of transcrip- 
tion was obtained from: interaction of recombinant 
hnRNP Al/Upl with the KRAS promoter G-quadruplex 

(27) ; Myc-associated zinc tinger protein (MAZ)/ 
poly(ADP-ribose) polymerase 1 (PARP-1) binding to the 
G-quadruplex element in the murine KRAS promoter 

(28) ; and binding of nucleohn/hnRNP proteins to the 
G-quadruplex forming sequences of the VEGF promoter 

(29) . Furthermore, similar motifs in the promoters of 
human sarcomeric mitochondrial creatine kinase, muscle 
creatine kinase and integrin alpha? of mouse were shown 
to associate with the dimeric form of MyoD in vitro 
(30,31). Consistent with these findings, transcriptome 
profihng in presence of intracellular G-quadruplex 
binding hgands indicated genome-wide role of 
G-quadruplex motifs in transcription (32). 

Taken together emerging computational/experimental 
evidence supporting G-quadruplex-mediated transcrip- 
tional functions raises an interesting question: can single 
nucleotide polymorphisms (SNP) that affect stabihty/ 



formation of the quadruplex structure influence transcrip- 
tion, resulting in individual-specific gene expression 
change? This possibihty has not been tested, although an 
independent line of study showed that SNPs that can po- 
tentially disrupt PG4 motifs were less frequent than 
expected (33). Using matched genotype information, 
SNP data from HapMap consortia (34) and gene expres- 
sion profile of respective individuals (from lympho- 
blastoid-derived cell fines) (35), we asked whether 
difference in expression of a particular gene is associated 
with SNPs that disrupt PG4 motif(s) (Scheme 1). This was 
tested using human SNPs and the chimpanzee as an an- 
cestral genome for comparison (36). Findings were verified 
using experimental gene expression reporter models that 
directly assayed the effect of G-quadruplex disruption on 
gene transcription in human cell lines. 

METHODS 

Analysis of HapMap data 

Genotype information was extracted from HapMap data 
repository (HapMap Release 21). Data for significant cor- 
relation with gene expression (from lymphoblastoid cell 
hues of the HapMap individuals) was used as reported 
by Stranger et al. (35), who analysed 2.2 million SNPs 
with 13 643 distinct gene probes and found 1348 genes 
where the SNP was significantly associated with expres- 
sion (c/.v-eQTL) of the respective gene in at least one of the 
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four HapMap populations. We mapped the cis-eQTLs to 
PG4 motifs using an in-house PERL algorithm. For each 
Quad-SNP average gene expression [using data from 
http://www.ncbi.nlni.nih.gov/geo database (GSE6536)] of 
all individuals of a genotype was calculated and denoted 
as the expression value for that particular genotype, and 
this data was plotted as a heat map to accommodate all 
the SNPs and genotypes studied. 

PG4 motif sequence retrieval and analysis 

PG4 motifs were identified as described earher (12). 
Briefly, we adopted a general pattern G3-L1-G3-L2- 
G3-L3-G3, where G is guanine; L is any nucleotide 
including G. The PG4 loops (LI, L2 and L3) could vary 
from one to seven bases. The program was rerun with 
cytosine instead of guanine to identify motifs on the com- 
plementary strand and was corrected for strand orienta- 
tion with respect to positioning in the gene. 

Sequence along with annotation of TSS for 18 056 
unique human Refseq genes were retrieved from UCSC 
build hgl8. PG4 motif sequence was identified as 
described earlier and mapped with validated SNPs using 
an in-house developed program. Validated SNPs were ex- 
tracted from dbSNP (as per criteria: 'by-frequency\ 
'byCluster', 'by2Hit2Allele', 'byOtherPop' or by 1000 
genomes). Random set of short regions for control 
analysis was made by extracting sequences of 15-33 nt 
from the same region that was used for extracting PG4 
motifs, that is, 1 kb of TSS. For determining Quad-SNP 
found in stem/loop of PG4 motifs, we defined a PG4 motif 
stem position as any G residue which was: (i) flanked on 
both sides by G residues; (ii) preceded; or (iii) succeeded 
by at least two G residues. All other bases, including G 
residues were considered as loops. 

Allele frequency data retrieval 

Out of 1 1 84 Quad-SNPs (vide infra) allele frequency data 
[Hapmart (34)] available for 356 Quad-SNPs in at least 
one population was used; 271 Quad-SNPs were found to 
be major alleles across all four populations. Of these, an- 
cestral allele information [UCSC (hgl8)] was available for 
237, which were finally used for the comparative analysis. 
Population-wise derived allele frequency (DAF) analysis 
was done considering all the SNPs for which both allele 
frequency information and ancestral allele information 
was available: 291 (CEU), 279 (YRl), 282 (CHB) and 
281 (JPT) Quad-SNPs. To confirm that the ancestral 
allele was invariant, we compared all human Quad-SNPs 
to their corresponding chimpanzee SNP (SNP125 — http:// 
hgdownload.cse.ucsc.edu/goldenPath/panTro 1 /database/ 
snpl25.txt. gz); human coordinates were converted to 
chimp (panTrol or 2) using Liftover from UCSC. For 
all the 237 human Quad-SNPs used for DAF analysis, 
the corresponding chimpanzee position was found to be 
invariant. As a further test, we used the Genomic 
Evolutionary Rate Profihng (GERP) score (37), which 
was downloaded from conservation tract of UCSC table 
browser. 



CD and melting experiments 

CD spectra were recorded using a JASCO-810 instrument. 
Ohgonucleotides were diluted to 3 |.iM final concentration 
in sodium cacodylate buffer (with lOOmM K^) prior to 
experiments, heated to 95°C and gradually cooled to 
ambient temperature overnight. CD scans were taken in 
a wavelength range of 220-320 nm at 20°C and scanning 
speed of 200nm/min. For each ohgo three scans were 
taken and spectrum of the buffer was subtracted. These 
samples were further used for melting experiments by first 
heating to a temperature of 95°C for lOniin and then 
slowly cooled to 25°C at a rate of l°C/min. UV absorb- 
ance was measured at 295 nm. 

Cell lines and culture conditions 

Human fibrosarcoma HT1080 and lung adenocarcinoma 
A549 ceU fines were obtained from National Centre for 
Cell Science, Pune and were maintained in MEM (HT 
1080) or DMEM supplemented with 10% FBS (A549). 

Cloning and reporter assays 

Promoter of RPS3 gene was cloned in the promoter-less 
basic pGL3 vector (Promega) using Xhol and Hind 111 
sites upstream of the luciferase gene following PCR amp- 
lification from normal genomic DNA using primers (FP — 
5'-AGAGCTCGAGAAAGAGAGAGGAAGGAAG 
GA-3', RP 5'-AATAAGCTTGACCGACAAATGCTC 
ACAAAC-3')- Clones were screened and sequenced for 
verification. Positive clones were subjected to site-directed 
mutagenesis using Quick Change Site-Directed mutagen- 
esis Kit (Stratagene) to get desired single base change 
within the PG4 sequence (GGGCGG|G C|CCCATG 
GGACCTTCTGGG). Prior to transfection, 12-well plates 
were seeded with 2.5 x 10^ cells to achieve optimum 
confluency. Plasmid (1.5 |ig) was transfected per well 
using lipofectamine 2000 (Invitrogen), according to manu- 
facturer's protocol. For transfection control 5ng of 
pGL4.73 was co-transfected. Cells were lysed after 24 h 
and luciferase assay was done using the dual luciferase 
assay kit from Promega, according to the manufacturer's 
protocol. Renilla counts were used for normalization. 
All experiments were done in triplicate. 

Incorporation of the synthetic quadruplex motif 

Synthetic quadruplex motif (GGGTGGGTGGGTGGG) 
and the sequence representing the corresponding dis- 
rupted motif (GAGTGAGTGAGTGAG) were cloned at 
the Bgl 11 site preceding the SV 40 promoter upstream of 
the luciferase gene {Renilla) in the psiCheck 2 vector 
(Promega). The firefly luciferase gene integrated within 
psiCheck 2 was used for normalization of transfection 
efficiency. Positive clones were confirmed by sequencing. 
A 12-weU plate was seeded as mentioned earlier and 2|.ig 
of plasmid was transfected using lipofectamine 2000 
(Invitrogen), according to manufacturer's protocol. Cells 
were lysed after 48 h and luciferase assay was done as 
given in the previous section. AU experiments were done 
in triplicate. 
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RESULTS 

Presence of SNP within PG4 motifs is linked to altered 
gene expression in individuals 

We hypothesized presence of SNPs that disrupt stabiHty of 
PG4 motifs aher gene expression in individuals. To test this 
we sought to analyse SNP data (34) along with gene expres- 
sion determined from lymphoblastoid cells of the respective 
individual where all SNPs that significantly correlate with 
altered expression of a gene have been reported (35). 

We found 54 SNPs lying within the potential quadruplex 
motif (Quad-SNP in following text) in 48 genes where 
change in genotype significantly correlated with altered 
gene expression in at least one population (18 genes har- 
bouring 19 Quad-SNP were differentially expressed in all 
the four populations). This constituted 42 Quad-SNP in 
CHB (Chinese) population and 26, 33 and 41 in YRI 
(Yoruba from Ibadan), CEU (Caucasians of European 
origin) and JPT (Japanese), respectively. For every 
Quad-SNP, distinct change in expression of the corres- 
ponding gene associated with the genotypes across individ- 
uals was clearly observed and is shown population-wise in 
Figure lA; each row represents a specific SNP, columns 
show respective genotypes (heterozygous in centre column 
flanked by homozygous). Difference in expression across 
the genotypes was statistically significant as reported 
earlier (/"< 0.0001 in all cases, see Supplementary Table 
SI for rs-id of SNPs). Interestingly, the heterozygous 
genotype always resulted in gene expression that was of 
an intermediate level with respect to the two homozygous 
groups — this is further illustrated using box plots for rep- 
resentative Quad-SNPs for genotypes in each population 
(Figure 1 B, right panel; data for all Quad-SNP is given in 
Supplementary Table SI). 

Quad-SNP result in disruption of the G-quadruplex motif 

Next, in order to test that the PG4 motifs detected above 
adopt the G-quadruplex motif and also to check whether 
the Quad-SNP results in altered stabihty of the motifs we 
randomly selected five sequences (with the SNP either in 
stem or loop. Supplementary Table S2). Ohgonucleotides 
were synthesized with or without the variation and 
circular dichroism (CD) experiments were performed in 
the presence of ion. As expected, aU the five sequences 
showed distinct characteristic of the G-quadruplex motif 
comprising both parallel (260 nm) and antiparallel 
(~290nni) orientations (Figure IB, left panel). 
Interestingly, we noted in all cases when a guanine base 
was disrupted in the stem of the PG4 motif, the charac- 
teristic peak at 260/290 nm was either disrupted or the 
peak height reduced indicating a general decrease in sta- 
bility. Accordingly, the melting temperature of the dis- 
rupted sequence also decreased. In cases when the 
Quad-SNP was found within the loop, if the quadruplex 
showed reduced stabihty in the CD signature, a corres- 
ponding decrease was observed in the melting tempera- 
ture. We noted one exception, rs 11570094 (Figure IB), 
where substitution of a guanine in the stem led to a CD 
spectrum which suggested increased stabihty, though the 
melting point was very similar. 



Most Quad-SNP maintain the chimpanzee allele 
within humans 

Next, using the 54 Quad-SNP found to be significantly 
associated with gene expression we asked whether the 
SNPs represented deviation or conservation in an evolu- 
tionary context. The chimpanzee genome was used to dis- 
tinguish alleles into ancestral (when similar to 
chimpanzee) or derived (when different from chimpanzee) 
(38). In majority of cases (43 of 54), we found the ancestral 
allele was commonly present in aU the four HapMap 
populations, in other words the derived form was found 
to be the minor allele. For 11 Quad-SNP, flipping was 
observed, that is, the major allele in human was different 
from the chimpanzee sequence. We noted with interest 
that only 4 out of the 1 1 flipped Quad-SNPs were found 
within the stem of the quadruplex and therefore were 
expected to directly affect stability of the quadruplex 
motif, while the remaining seven flipped bases were 
present within loop of the PG4 motif and therefore were 
not expected to significantly affect quadruplex stabihty. 

Given the prevalence of PG4 motifs near promoters in 
addition to multiple studies showing role of the 
quadruplex motif in gene expression (12,13,17,19,20) we 
next analysed the region within 1 kb of transcription start 
sites (TSS) in 18 056 unique human promoters. We found 
1184 validated Quad-SNP (see 'Methods' section) in this 
region (note: only 54 Quad-SNP that were significantly 
associated with gene expression as reported in (35) were 
analysed in the previous sections). Out of 1184, we first 
used 237 Quad-SNP, where both allele frequency and an- 
cestral allele data were available for all four populations, 
for further study ('Methods' section). Figure 2A depicts 
the genotype frequency of Quad-SNPs in each population 
where each row represents a bar graph showing ancestral/ 
derived allele frequency for a given SNP. In fine with our 
earlier observation using 54 Quad-SNP, here we found 
that in 195 out of 237 (82.2%) ancestral allele was the 
common or major allele whereas in only 42 (17.7%) 
flipping was observed. The fraction of ancestral versus 
derived alleles in each population further confirmed the 
finding that ancestral alleles were mostly maintained 
among Quad-SNP (Figure 2B). 

PG4 motif stems are maintained by evolutionarily 
restricting destabilizing substitutions 

Since regions constituting the stem of the PG4 motifs are 
known to be relatively more important for structural sta- 
bility we hypothesized that a Quad-SNP that potentially 
disrupts the structure could be under pressure to be 
conserved/promoted depending on the selective advantage 
that the PG4 motif may impart. In order to test this, we 
considered the Quad-SNPs that had low DAF (0-0.1), i.e. 
ones that appear to resist change from the ancestral form. 
Using these we asked whether there was any difference in 
number of SNP occurring within stems compared to loops 
for the population-specific Quad-SNPs [291 (CEU), 279 
(YRI), 282 (CHB) and 281 (JPT)]. Interestingly, in all the 
four populations we found that the numbers of stem-Quad- 
SNP were significantly more than loop-Quad-SNP for the 
DAF category 0-0.1 (Figure 2C, two-tailed Z-test, 
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Figure 2. Promoter G-quadruplex motifs maintain the ancestral (cliimpanzee) form. (A and B) Individual allele frequencies of Quad-SNP in the four 
HapMap populations — bar graph shows frequency of ancestral (chimpanzee) and derived allele for each SNP within a population (A) along with 
respective fractions of Quad-SNP that were either maintained with ancestral as major allele or flipped to the derived allele (B). (C) Categorization of 
stem/loop Quad-SNP with low (0-0.1), moderate (>0. 1-0.5) or high (>0.5) derived allele frequencies shows stem SNP are significantly 
over-represented in the low category. 



P = 0.002, Supplementary Table S3). In contrast, this dif- 
ference between stem/loop Quad-SNPs was not significant 
in any of the other higher DAF categories. Together, this 
suggests the likelihood that stem SNPs that could 



potentially disrupt the structure are being disfavoured in 
an evolutionary sense. 

Next we sought to check the selection constraint metric 
GERP (37) for Quad-SNPs. Based on the understanding 
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that the rate of natural selection is compromised when 
purifying selection acts on a locus GERP estimates the 
'rejected substitution' or RS score for given SNPs. Since 
the RS score is calculated by subtracting actual number of 
substitutions at a site from the expected number under 
neutral condition, sites under selective constrain attain 
positive scores (37,39,40). On analysing the 
population-specific Quad-SNPs, consistent with what 
was noted above, we found stem-Quad-SNPs had higher 
proportion of positive RS-scores compared to loop SNPs 
(two-tailed /-test, P < 0.001, Supplementary Figure SI). In 
addition, for further testing we used the recently released 
SNP data from population-scale sequencing of 1000 
human genomes (41). We found 3372 Quad-SNP within 
1 kb of 18 056 genes — 2430 and 942 were within stem and 
loop of the G-quadruplex motif, respectively. Again, in 
Une with our earlier observations, we found relatively 
higher proportion of stem-Quad-SNPs had positive RS 
scores (Supplementary Figure SI, P = 0.001). 

Most promoter PG4 motifs are devoid of SNPs 

Above studies indicated a possible bias against the 
presence of SNPs within PG4 motifs present in regulatory 
regions. This prompted us to ask whether SNPs were 
asymmetrically distributed in PG4 motifs, i.e. what pro- 
portion of PG4 motifs had any SNP at all. This was 
checked in 18 056 unique Refseq gene promoters (± 1 kb 
of TSS) which had 72 263 vahdated SNPs (~2 SNP/kb). 
Out of these, as mentioned earlier, we found 1184 SNPs 
within 32 716 PG4 motifs (comprising 820 903 bases, 
average 15- to 33-mers) found in this region resulting in 
a density of 1.4 SNP/kb indicating that PG4 motifs are 
depleted in SNPs (P< 0.0001; test) consistent with a 
previous study, which used a different computational 
program for detecting motifs (33). Furthermore, we 
analysed an equivalent number (32 000) of randomly 



picked short sequences of similar average GC% from 
±lkb of TSS— this gave a density of 2.1 SNP/kb 
(/■< 0.0001; X' test). Interestingly out of the 32 716 PG4 
motifs only 1113 had any SNP (P = 3.9e~^^'^\ / test), i.e. 
>96% of the promoter PG4 motifs were devoid of any 
polymorphism. In order to check this further, we used the 
recently released SNP data from population-scale 
sequencing of 1000 human genomes (41). In this case, 
we found 3372 Quad-SNP, vahdated by the 1000 
genome project, occurring within 2982 of the 32 716 PG4 
motifs present within 1 kb of 18 056 genes. This again 
showed that only ~9% of promoter PG4 motifs had one 
or more polymorphic sites indicating that the distribution 
of SNP within PG4 motifs was significantly skewed when 
compared to expected distribution {P = 8.2e~"'; test). 
Together, these studies strongly indicated a possible bias 
against nucleotide substitutions that could lead to disrup- 
tion of quadruplex units in the genome. 

Quadruplex-disrupting SNP results in significantly altered 
promoter activity 

Next, to test above findings we sought to study a PG4 
motif/SNP combination that was independent of the 
data sets analysed above and asked: (i) whether the 
specific nucleotide substitution resulted in disruption of 
the G-quadruplex structure and (ii) if the disruption 
caused any alteration in expression of the gene. For this 
the SNP (rsl7880356, G to C) found in the promoter of 
the ribosomal protein S3 (RPS3) (Figure 3A), which plays 
a critical role in initiation of translation, was selected. In 
order to determine whether this sequence adopted the 
quadruplex motif, and if the substitution significantly dis- 
rupted the structure, we first synthesized two oligonucleo- 
tides, S3 A and S3B comprising the PG4 motif representing 
both the alleles of the Quad-SNP found in RPS3, 
where S3A had the G-base while S3B had the substitution 
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Figure 3. Quad-SNP affects promoter activity of RPS3. (A) Scheme sliowing part of RPS3 promoter with sequence of the PG4 motif given in bold; 
Quad-SNP is underlined. (B) CD spectra of PG4 motif sequences S3A and S3B, melting temperature (T^) in right frame. (C) Scheme showing 
promoter reporter systems inserted upstream of the firefly luciferase gene. Luciferase reporter activity of reporter clones with either S3A or S3B 
relative to no insert clone is shown below; activity in case of S3B in A549 cells was not detectable (asterisks). Experiments were done in triplicate; 
Renilla luciferase activity was used to normalize transfection efficiency. 



(G to C). G-quadruplex forming potential was determined 
using CD spectroscopy — S3A gave a well formed parallel 
quadruplex whereas S3B showed decrease in peak 
height at 260 nm suggesting loss of structural stability 
(Figure 3B). We also found that the of the 

G-quadruplex motif was 62.1°C whereas that of the S3B 
motif was substantially decreased to 48°C, consistent with 
CD results (Figure 3B, right panel). 

Following this we sought to check the influence of the 
PG4 motif, and the substitution, on transcription of 



RPS3. To test promoter activity luciferase reporter 
systems were constructed using the 1.5-kb long putative 
promoter of human RPS3 harbouring the PG4 motif, 
which was cloned upstream of the firefly luciferase gene; 
expression of Renilla luciferase was used as control 
(Figure 3C). An additional construct was made to repre- 
sent S3B after incorporating the specific SNP (G to C) 
within the PG4 motif. We first checked promoter 
activity in human fibrosarcoma cells. Remarkably, 
activity of S3A was found to be substantially high. 
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Figure 4. Incorporation of the G-quadruplex motif and not sequence 
per se induces promoter activity. (A) Scheme showing the constructs 
made to insert either a G-quadruplex-forming (G4) or disrupted G4 
(disG4) as control sequence upstream of SV40 promoter in a luciferase 
reporter vector. (B) CD spectra of oligonucleotide used for G4 motif 
and disG4 showing disruption of the quadruplex motif in case of 
disG4. (C) Luciferase reporter activity of clones harbouring G4 or 
disG4 in human cell lines with respect to the no-insert construct. All 
experiments were done in triplicate using ReniUa luciferase activity as 
transfection control. 

which decreased by > 80-fold on G to C substitution 
within the PG4 motif (S3B), supporting gene expression 
that is linked to presence of the quadruplex motif. 
Considering the extent of difference observed we sought 
to check this in a second cell line. On using the human 
adenocarcinoma cell hne (A549), we found >25-fold 
increase in expression of S3A relative to the empty 
vector. In line with the earlier observation, here also in 
case of S3B expression was very low and could not be 
detected. Together, these experiments support our earlier 
findings and demonstrate that disruption of a 
promoter-PG4 motif could lead to significant change in 
gene expression. 

Insertion of synthetic G-quadruplex motif affects promoter 
activity of reporter construct inside cells 

To test quadruplex-mediated transcription in a more 
direct fashion we made a synthetic G-quadruplex motif 



and incorporated this upstream of an exogenous promoter 
reporter system constituting the SV40 promoter upstream 
of the firefly luciferase gene (Figure 4A). An analogous 
system was made by introducing a similar sequence 
wherein the quadruplex motif was disrupted by specific 
nucleotide changes to constitute a negative control that 
did not adopt the quadruplex form. We confirmed that 
the substitutions led to disruption of the structure using 
CD (Figure 4B) and DNA melting experiments (data not 
shown). Following this luciferase activity was checked in 
two cell lines and reporter activity from firefly luciferase 
was normahzed using Renilla luciferase counts to control 
for transfection efficiency. Promoter activity increased on 
quadruplex insertion by ~1.9- and 2.3-folds in HT1080 
and A549 cells, respectively (Figure 4C). In contrast, 
reporter activity when the quadruplex motif was disrupted 
was similar to the inherent SV40 promoter activity. 
Together these results showed that incorporation of the 
quadruplex motif results in altered promoter activity due 
to the presence of the structural motif and is lost when the 
structure is specifically disrupted. 



DISCUSSION 

Taken together results reported here show that integrity of 
the G-quadruplex secondary structure form is necessary 
for transcription. This is supported by multiple fines of 
findings demonstrating that any change in the quadruplex 
structure influences transcription. We found promoter 
quadruplex sequences not only harbour low number of 
polymorphic sites but are mostly devoid of any SNP 
that could potentially disrupt the structure. Interestingly, 
even in the small fraction of PG4 motifs with SNPs it was 
found that nucleotide changes, with respect to chimpan- 
zee, occurred in a minor percentage of human popula- 
tions. Moreover, this resistance to change with respect 
to chimpanzee was distinctly noted in SNPs that could 
potentially disrupt the quadruplex structure and not in 
ones that are expected to have limited effect on structure 
(e.g. SNPs within loop region of the quadruplex motif), 
supporting the notion that integrity of the structure was 
critical for function. These findings were further supported 
by results obtained from exogenous addition of a synthetic 
quadruplex motif: reporter gene expression was noted to 
be directly influenced by incorporation of the quadruplex 
structure, which was lost when nucleotide substitutions 
that specificafiy compromised the quadruplex secondary 
architecture was introduced. 

At a genome-wide level, we found many SNPs within 
PG4 motifs where the individual genotypes were strongly 
correlated to gene expression (Figure 1). It was also 
evident from Figure 1 that largely individuals having het- 
erozygous genotype had gene expression levels that were 
of an intermediate level relative to the corresponding 
homozygous genotypes. Thus, Quad-SNPs fitted weU in 
an additive genetic model of inheritance, where the 
allelic change modulates the phenotype in a dose depend- 
ent manner (42). Though correlative, considered with 
other findings, this implicates quadruplex motifs in a 
broader sense suggesting that gene expression of 
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individuals could be influenced by base changes that either 
form or disrupt a secondary DNA structure. 

Recent data from population-scale sequencing in the 
1000 genomes project (41) gives a much enhanced 
coverage of SNPs than obtained by HapMap. Indeed 
using this data set also we noted that SNPs occur in 
only a small proportion (<10%) of promoter PG4 
motifs, in hne with our observation from analysis of 
HapMap data. Further analysis of association with gene 
expression using SNPs from 1000 genomes data would be 
interesting. However, this was not possible as gene expres- 
sion data of only a limited number of individuals are 
publicly available at this time, and many of the variants 
being rare would require expression data from large 
number of individuals to ascertain association with signifi- 
cance. Furthermore, we also noted that a recent study that 
compared HapMap3 and 1000 genomes genotypes for 
eQTLs using the CEU and YRI expression data sets 
found similar numbers of eQTLs between the two 
projects. Therefore, while resequencing gives many novel 
associations it is possible that most common effects have 
been captured with previous genotyping-based approaches 
(43). On the other hand, and perhaps more importantly, 
still in order to test causal link between G-quadruplex 
and Quad-SNP one would need to resort to experimental 
approaches. Keeping this in mind, we focused on evidence 
from transcriptional results that were caused by directed 
base changes that specifically disrupt G-quadruplex forms 
in order to build support for the G-quadruplex-gene ex- 
pression connection among individuals. 

Several of the 54 Quad-SNPs that significantly asso- 
ciated with gene expression across individuals were found 
at a relatively long distance from TSS (Supplementary 
Table SI). Influence on transcription for such instances is 
difficult to reason without direct evidence, though in case 
of eukaryotes optimum distance for regulatory control 
varies considerably and many cases of long-range regula- 
tion have been reported (44^6). On the other hand, using 
chromatin immunoprecipitation (ChIP) followed by 
sequencing, association of transcription factors have been 
noted that are in regions distant, and both upstream/down- 
stream, of TSS (47). Therefore, the likelihood that SNPs 
that are far and both upstream/downstream of TSS can 
affect transcription cannot be ruled out. 

Throughout this study we have considered loop sizes 
that were restricted to seven bases based on eadier 
reports (12,48) whereas more recent findings suggest that 
loops of stable quadruplex can be 10 bases or more 
(49,50). Therefore the number of PG4 motifs and SNPs 
detected in this study is perhaps a conservative set of 
possibilities. 

In an earlier study it was reported that the G-tracts 
critical for stabihty of the G-quadruplex motif show low 
polymorphism (33). Authors further detected that any 
given short G-tract in the human genome had relatively 
low polymorphism irrespective of whether it was a part of 
G-quadruplex structure. These observations suggested a 
role of the G-tract that may not be related to the 
G-quadruplex motif. Our experiments using the RPS3 
promoter and, particularly, the synthetic quadruplex 
reporter system show deformation of the quadruplex 



motif has distinct and remarkable effect on promoter 
activity. These experiments show in a relatively directly 
way that nucleotide base changes in integral positions of 
the quadruplex, namely the G-tract, are likely to have 
important functional consequence. On the other hand, 
though CD spectroscopy confirms G-quadruplex structure 
formation by ohgonucleotides, it does not completely rule 
out constitution of other structural forms. Therefore con- 
tribution from non-G-quadruplex secondary structures is 
difficult to fully negate. Nonetheless, base substitutions in 
our experimental study were designed so that they perturb 
specifically G-quadruplexes and therefore are likely to 
support changes due to G-quadruplex structure forma- 
tion/deformation. 

In a recent genome-wide study, we found G-quadruplex 
motifs are closely associated with several DNA binding 
proteins in human, chimpanzee, mouse and rat (51). 
G-quadruplex association with SPl, hnRNP Al, MAZ 
and nucleolin has also been noted (27-29,52). Therefore, 
destabilization of G-quadruplex forms are likely to disrupt 
association with factors leading to impairment of enhancer/ 
repressor functions. This could be a likely reason for 
the substantial change (in case of fibrosarcoma cells. 
Figure 3C) noted in transcription, given the single base 
change that was incorporated. On similar lines, we noted 
that moderate changes in G-quadruplex stability or even 
alteration in bases within potential loop regions (Figure I), 
at times, resulted in significant change in gene expression 
among individuals. This again suggests the possibihty that 
subtle changes in the G-quadruplex form/stability leads to 
relatively pronounced gene expression changes due to 
altered DNA binding of transcription factor(s). 

Another interesting aspect of the findings stems from 
the fact that a distinct difference was noted in the fre- 
quency of polymorphisms within populations with 
respect to their occurrence in stems/loops of the 
G-quadruplex motifs. Stem SNPs maintained a bias 
towards the ancestral form (predominant in low DAF 
category. Figure 2C). This suggests an interesting evolu- 
tionary perspective. It is widely understood that natural 
selection acts to conserve/disrupt functional elements 
within a genome and thereby drives evolution and popu- 
lation differentiation (53). Therefore, if a particular locus 
is not diversifying then the ancestral allele would be 
expected to remain unchanged or conserved, whereas 
any change generally signifies selective advantage. Based 
on this, it is tempting to speculate that perhaps the struc- 
tural form of a quadruplex is being maintained, whereas 
loop SNPs that are largely not expected to affect structure 
are relatively more amenable to change. 

SUPPLEMENTARY DATA 
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